The CoMeRe corpus for French: structuring and annotating heterogeneous CMC genres

نویسندگان

Thierry Chanier

Céline Poudat

Benoît Sagot

Georges Antoniadis

Ciara Wigham

Linda Hriba

Julien Longhi

Djamé Seddah

چکیده

The CoMeRe project aims to build a kernel corpus of different Computer-Mediated Communication (CMC) genres with interactions in French as the main language, by assembling interactions stemming from networks such as the Internet or telecommunication, as well as mono and multimodal, synchronous and asynchronous communications. Corpora are assembled using a standard, thanks to the TEI (Text Encoding Initiative) format. This implies extending, through a European endeavor, the TEI model of text, in order to encompass the richest and the more complex CMC genres. This paper presents the Interaction Space model. We explain how this model has been encoded within the TEI corpus header and body. The model is then instantiated through the first four corpora we have processed: three corpora where interactions occurred in single-modality environments (text chat, or SMS systems) and a fourth corpus where text chat, email and forum modalities were used simultaneously. The CoMeRe project has two main research perspectives: Discourse Analysis, only alluded to in this paper, and the linguistic study of idiolects occurring in different CMC genres. As NLP algorithms are an indispensable prerequisite for such research, we present our motivations for applying an automatic annotation process to the CoMeRe corpora. Our wish to guarantee generic annotations meant we did not consider any processing beyond morphosyntactic labelling, but prioritized the automatic annotation of any freely variant elements within the corpora. We then turn to decisions made concerning which annotations to make for which units and describe the processing pipeline for adding these. All CoMeRe corpora are verified, thanks to a staged quality control process, designed to allow corpora to move from one project phase to the next. Public release of the CoMeRe corpora is a short-term goal: corpora will be integrated into the forthcoming French National Reference Corpus, and disseminated through the national linguistic infrastructure ORTOLANG. We, therefore, highlight issues and decisions made concerning the OpenData perspective.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Building and Annotating Corpora of Computer-Mediated Communication: Issues and Challenges at the Interface of Corpus and Computational Linguistics

The CoMeRe project aims to build a kernel corpus of different computer-mediated communication (CMC) genres with interactions in French as the main language, by assembling interactions stemming from networks such as the Internet or telecommunications, as well as mono and multimodal, and synchronous and asynchronous communications. Corpora are assembled using a standard, thanks to the Text Encodi...

متن کامل

EmpiriST 2015: A Shared Task on the Automatic Linguistic Annotation of Computer-Mediated Communication and Web Corpora

This paper describes the goals, design and results of a shared task on the automatic linguistic annotation of German language data from genres of computer-mediated communication (CMC), social media interactions and Web corpora. The two subtasks of tokenization and part-of-speech tagging were performed on two data sets: (i) a genuine CMC data set with samples from several CMC genres, and (ii) a ...

متن کامل

Can Scaffolding Mechanisms of Structuring and Problematizing Facilitate the Transfer of Genre-based Knowledge to Another Discourse Mode?

A pivotal issue in research on writing concerns whether the knowledge of how genres are constructed and learned in one discipline/genre can be transferred to other contexts, genres, and disciplines. Yet, studies conducted so far have not presented a unified and complete view of how various writing instructional techniques can result in transferability. This study examined the effect of structur...

متن کامل

DeRiK: A German reference corpus of computer-mediated communication

The paper describes an ongoing project that aims at building a reference corpus of German computer-mediated communication (CMC) as a new component of an already existing reference corpus of written contemporary German. The ‘Deutsches Referenzkorpus zur internetbasierten Kommunikation’ (DeRiK) shall include data from the most prominent CMC genres amongst German Internet users and, thus, close a ...

متن کامل

Annotating Football Matches: Influence of the Source Medium on Manual Annotation

In this paper, we present an annotation campaign of football (soccer) matches, from a heterogeneous text corpus of both match minutes and video commentary transcripts, in French. The data, annotations and evaluation process are detailed, and the quality of the annotated corpus is discussed. In particular, we propose a new technique to better estimate the annotator agreement when few elements of...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

JLCL

دوره 29 شماره

صفحات -

تاریخ انتشار 2014

The CoMeRe corpus for French: structuring and annotating heterogeneous CMC genres

نویسندگان

چکیده

منابع مشابه

Building and Annotating Corpora of Computer-Mediated Communication: Issues and Challenges at the Interface of Corpus and Computational Linguistics

EmpiriST 2015: A Shared Task on the Automatic Linguistic Annotation of Computer-Mediated Communication and Web Corpora

Can Scaffolding Mechanisms of Structuring and Problematizing Facilitate the Transfer of Genre-based Knowledge to Another Discourse Mode?

DeRiK: A German reference corpus of computer-mediated communication

Annotating Football Matches: Influence of the Source Medium on Manual Annotation

عنوان ژورنال:

اشتراک گذاری